--- layout: page title: Data Science Master's Thesis htmlwidgets: TRUE permalink: /predictive-analytics-thesis/ ---
Welcome to my spot on the web for drafts, supplemental material, and general thoughts about doing a thesis project for the Master of Science in Predictive Analytics degree (now the Master's in Data Science (MSDS) program) from Northwestern University. Below the interactive plots, I'm developing a sort of "epilogue" containing thoughts about doing a data science Master's, choosing the thesis option, and some of the things I've learned along the way.
I'll update this section with drafts as they get finished.
2018-11-04: I have a (mostly) completed draft you can check out here on Google Drive. I'm currently awaiting comments from readers so no doubt it will change substantially. I haven't put in a Table of Contents and I'm still figuring out how to list the supplemental materials you'll find on this page in but everything else is there (hooray!).
2018-12-16: A lot has changed in the last month or so! I've decided to push back my tentative graduation date from this month to the end of the 2019 Winter Quarter, in part due to starting a new position as a data scientist at Highmark Health here in Pittsburgh. I had the thesis draft reviewed by my first reader who suggested some restructuring for the Conclusions section but otherwise found it to be good.
I spent a few weeks away from the thesis which allowed me to come back to it with a fresh set of eyes. I made some grammar edits and added the Table of Contents as well as the Appendix listing the supplemental material (links to the Github repo and this webpage). The most recent version is v.4.0 which can be accessed here. This is a completely formatted draft with all the necessary components as outlined in the Graduate Thesis Handbook.
I'm happy to have some time to finish the process in a way that isn't rushed. I'll be working over the holidays to restructure the Conclusions section and hope to get notes from a second reader by the end of January. Barring any substantial unforeseen issues, I should have everything done by the March 15th deadline to graduate at the end of the Winter 2019 quarter (hooray!).
2019-04-06: It's been a minute but, yes, I finished the degree. As of March 29, 2019, I'm an official graduate of Northwestern's MSPA program. One of the last of the "old guard" since the program has now changed significantly since I started it back in 2016 (including in name as it is now the Master of Science in Data Science).
A copy of the final accepted version of my thesis can be found here.
Many thanks to my readers, Drs. Alianna Maren and Lawrence Fulton, to all my professors, and to all my classmates from whom I learned so much.
I've added to the list of things I've learned at the bottom of this page. Please don't hesitate to reach out using the links at the bottom of the page if you have any questions about doing a data science master degree, choosing between a capstone or doing a thesis, or anything else. If you find a broken link here, please let me know 😃
Good luck to everyone on a data science journey!
All the code (mostly in R) for the thesis can be found in the project repo on GitHub.
Below are four interactive multidimensional scaling plots of genetic profiles developed from open-source RNA-seq data available from the Aging, Dementia, and TBI Study from the Allen Brain Science Institute.
Use your mouse to grab them, rotate them, and zoom in and out. Hovering over a data point gives the point's coordinates in the first three MDS dimensions. Each point represents a genetic profile (based on expression levels for 50,000+ genes and gene isoforms) for an individual patient/donor.
These were made using Plotly and htmlwidgets for R. Check out this blog post for more on multidimensional scaling of gene expression level data.
HIP = hippocampus
FWM = forebrain white matter
PCx = parietal cortex
TCx = temporal cortex
A comparison of the numbers of "significant" genes obtained with different filtering parameters and p-value cutoffs for determining differential expression in donors with dementia.
Filtering & P-Value Cutoff Experiment Spreadsheet
As a part of the exploratory analysis of the RNA-seq transcriptome data, I investigated the 29 genes that had altered expression patterns in all four brain regions sampled from donors with dementia (hippocampus, forebrain white matter, parietal cortex, or temporal cortex).
Brain Region Intersection Gene Details
As things started to wrap up for me, I found myself reflecting on the entire experience of doing the MSPA program. Maybe you stumbled onto this page beacuse you're thinking of pursuing a data science Master's degree. Or maybe you're already in the MSDS program at Northwestern or somewhere else and are trying to make the "thesis or capstone" decision. In this section, I list of some of the things I've learned from doing this degree with a focus on doing a thesis project. Just my $0.02. FWIW, etc. I'm putting it down here as a sort of epilogue to the thesis now that she's all done.
“Life can only be understood backwards; but it must be lived forwards.” - Kierkegaard

Feel free to reach out to me if you have any questions!